Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes #3588

Closed
wants to merge 6 commits into from

Conversation

jkbradley
Copy link
Member

Documentation:

  • Added ml-guide.md, linked from mllib-guide.md
  • Updated mllib-guide.md with small section pointing to ml-guide.md

Examples:

  • CrossValidatorExample
  • SimpleParamsExample
  • (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)

Bug fixes:

  • PipelineModel: did not use ParamMaps correctly
  • UnaryTransformer: issues with TypeTag serialization (Thanks to @mengxr for that fix!)

CC: @mengxr @shivaram @etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.

jkbradley and others added 5 commits December 2, 2014 14:59
…sValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
replace TypeTag with explicit datatype
…rossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
@SparkQA
Copy link

SparkQA commented Dec 3, 2014

Test build #24103 has finished for PR 3588 at commit c38469c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaCrossValidatorExample
    • public class JavaSimpleParamsExample

@jkbradley jkbradley changed the title [SPARK-4575] [mllib] spark.ml pipelines doc + bug fixes [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes Dec 4, 2014
transformSchema(dataset.schema, paramMap, logging = true)
stages.foldLeft(dataset)((cur, transformer) => transformer.transform(cur, paramMap))
// Precedence of ParamMaps: paramMap > this.paramMap > fittingParamMap
val map = (fittingParamMap ++ this.paramMap) ++ fittingParamMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get the logic here. this.paramMap contains only parameters to the Pipeline instance and the input paramMap is not included. fittingParamMap is for record purpose. All relevant parameters should be inherited from the parent algorithm to model.paramMap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think this was the wrong way to fix the bug.

The bug was:
The model was training with hashingTF.numFeatures features (10, 100 or 1000), but when PipelineModel.transform() was called, HashingTF used the default numFeatures.

I probably should fix it by changing Pipeline.fit() to merge all relevant parameters from the paramMap passed to fit() into the transformers. (The params are stored in Models but not other Transformers right now.) I'll make that change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I see the issue now. Could we create a separate PR? This is not a blocker for the release, and it might need some discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think it was just a typo. The comment is correct. It should be:

val map = (fittingParamMap ++ this.paramMap) ++ paramMap

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, thanks!

@jkbradley
Copy link
Member Author

@mengxr Thanks for reviewing! Updated based on all comments.

@SparkQA
Copy link

SparkQA commented Dec 4, 2014

Test build #24129 has finished for PR 3588 at commit d393b5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaCrossValidatorExample
    • public class JavaSimpleParamsExample

@asfgit asfgit closed this in 469a6e5 Dec 4, 2014
asfgit pushed a commit that referenced this pull request Dec 4, 2014
Documentation:
* Added ml-guide.md, linked from mllib-guide.md
* Updated mllib-guide.md with small section pointing to ml-guide.md

Examples:
* CrossValidatorExample
* SimpleParamsExample
* (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)

Bug fixes:
* PipelineModel: did not use ParamMaps correctly
* UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)

CC: mengxr shivaram  etrain  Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3588 from jkbradley/ml-package-docs and squashes the following commits:

d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit).  updated examples for CV and Params for spark.ml
c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params.  Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version.  CrossValidatorExample not working yet.  Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.

(cherry picked from commit 469a6e5)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@mengxr
Copy link
Contributor

mengxr commented Dec 4, 2014

LGTM. Merged into master and branch-1.2. Thanks a lot!!

@jkbradley jkbradley deleted the ml-package-docs branch July 25, 2016 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants